8 research outputs found
CHAMPAGNE: Learning Real-world Conversation from Large-Scale Web Videos
Visual information is central to conversation: body gestures and physical
behaviour, for example, contribute to meaning that transcends words alone. To
date, however, most neural conversational models are limited to just text. We
introduce CHAMPAGNE, a generative model of conversations that can account for
visual contexts. To train CHAMPAGNE, we collect and release YTD-18M, a
large-scale corpus of 18M video-based dialogues. YTD-18M is constructed from
web videos: crucial to our data collection pipeline is a pretrained language
model that converts error-prone automatic transcripts to a cleaner dialogue
format while maintaining meaning. Human evaluation reveals that YTD-18M is more
sensible and specific than prior resources (MMDialog, 1M dialogues), while
maintaining visual-groundedness. Experiments demonstrate that 1) CHAMPAGNE
learns to conduct conversation from YTD-18M; and 2) when fine-tuned, it
achieves state-of-the-art results on four vision-language tasks focused on
real-world conversations. We release data, models, and code.Comment: ICCV 2023, Project page: https://seungjuhan.me/champagn
Value Kaleidoscope: Engaging AI with Pluralistic Human Values, Rights, and Duties
Human values are crucial to human decision-making. Value pluralism is the
view that multiple correct values may be held in tension with one another
(e.g., when considering lying to a friend to protect their feelings, how does
one balance honesty with friendship?). As statistical learners, AI systems fit
to averages by default, washing out these potentially irreducible value
conflicts. To improve AI systems to better reflect value pluralism, the
first-order challenge is to explore the extent to which AI systems can model
pluralistic human values, rights, and duties as well as their interaction.
We introduce ValuePrism, a large-scale dataset of 218k values, rights, and
duties connected to 31k human-written situations. ValuePrism's contextualized
values are generated by GPT-4 and deemed high-quality by human annotators 91%
of the time. We conduct a large-scale study with annotators across diverse
social and demographic backgrounds to try to understand whose values are
represented.
With ValuePrism, we build Kaleido, an open, light-weight, and structured
language-based multi-task model that generates, explains, and assesses the
relevance and valence (i.e., support or oppose) of human values, rights, and
duties within a specific context. Humans prefer the sets of values output by
our system over the teacher GPT-4, finding them more accurate and with broader
coverage. In addition, we demonstrate that Kaleido can help explain variability
in human decision-making by outputting contrasting values. Finally, we show
that Kaleido's representations transfer to other philosophical frameworks and
datasets, confirming the benefit of an explicit, modular, and interpretable
approach to value pluralism. We hope that our work will serve as a step to
making more explicit the implicit values behind human decision-making and to
steering AI systems to make decisions that are more in accordance with them
Self-Refine: Iterative Refinement with Self-Feedback
Like people, LLMs do not always generate the best text for a given generation
problem on their first try (e.g., summaries, answers, explanations). Just as
people then refine their text, we introduce SELF-REFINE, a framework for
similarly improving initial outputs from LLMs through iterative feedback and
refinement. The main idea is to generate an output using an LLM, then allow the
same model to provide multi-aspect feedback for its own output; finally, the
same model refines its previously generated output given its own feedback.
Unlike earlier work, our iterative refinement framework does not require
supervised training data or reinforcement learning, and works with a single
LLM. We experiment with 7 diverse tasks, ranging from review rewriting to math
reasoning, demonstrating that our approach outperforms direct generation. In
all tasks, outputs generated with SELF-REFINE are preferred by humans and by
automated metrics over those generated directly with GPT-3.5 and GPT-4,
improving on average by absolute 20% across tasks.Comment: Code, data, and demo at https://selfrefine.info
Inference-Time Policy Adapters (IPA): Tailoring Extreme-Scale LMs without Fine-tuning
Large language models excel at a variety of language tasks when prompted with
examples or instructions. Yet controlling these models through prompting alone
is limited. Tailoring language models through fine-tuning (e.g., via
reinforcement learning) can be effective, but it is expensive and requires
model access.
We propose Inference-time Policy Adapters (IPA), which efficiently tailors a
language model such as GPT-3 without fine-tuning it. IPA guides a large base
model during decoding time through a lightweight policy adaptor trained to
optimize an arbitrary user objective with reinforcement learning.
On five challenging text generation tasks, such as toxicity reduction and
open-domain generation, IPA consistently brings significant improvements over
off-the-shelf language models. It outperforms competitive baseline methods,
sometimes even including expensive fine-tuning. In particular, tailoring GPT-2
with IPA can outperform GPT-3, while tailoring GPT- 3 with IPA brings a major
performance boost over GPT-3 (and sometimes even over GPT-4). Our promising
results highlight the potential of IPA as a lightweight alternative to
tailoring extreme-scale language models
Faith and Fate: Limits of Transformers on Compositionality
Transformer large language models (LLMs) have sparked admiration for their
exceptional performance on tasks that demand intricate multi-step reasoning.
Yet, these models simultaneously show failures on surprisingly trivial
problems. This begs the question: Are these errors incidental, or do they
signal more substantial limitations? In an attempt to demystify Transformers,
we investigate the limits of these models across three representative
compositional tasks -- multi-digit multiplication, logic grid puzzles, and a
classic dynamic programming problem. These tasks require breaking problems down
into sub-steps and synthesizing these steps into a precise answer. We formulate
compositional tasks as computation graphs to systematically quantify the level
of complexity, and break down reasoning steps into intermediate sub-procedures.
Our empirical findings suggest that Transformers solve compositional tasks by
reducing multi-step compositional reasoning into linearized subgraph matching,
without necessarily developing systematic problem-solving skills. To round off
our empirical study, we provide theoretical arguments on abstract multi-step
reasoning problems that highlight how Transformers' performance will rapidly
decay with increased task complexity.Comment: 10 pages + appendix (21 pages
Evaluating Open-Domain Question Answering in the Era of Large Language Models
Lexical matching remains the de facto evaluation method for open-domain
question answering (QA). Unfortunately, lexical matching fails completely when
a plausible candidate answer does not appear in the list of gold answers, which
is increasingly the case as we shift from extractive to generative models. The
recent success of large language models (LLMs) for QA aggravates lexical
matching failures since candidate answers become longer, thereby making
matching with the gold answers even more challenging. Without accurate
evaluation, the true progress in open-domain QA remains unknown. In this paper,
we conduct a thorough analysis of various open-domain QA models, including
LLMs, by manually evaluating their answers on a subset of NQ-open, a popular
benchmark. Our assessments reveal that while the true performance of all models
is significantly underestimated, the performance of the InstructGPT (zero-shot)
LLM increases by nearly +60%, making it on par with existing top models, and
the InstructGPT (few-shot) model actually achieves a new state-of-the-art on
NQ-open. We also find that more than 50% of lexical matching failures are
attributed to semantically equivalent answers. We further demonstrate that
regex matching ranks QA models consistent with human judgments, although still
suffering from unnecessary strictness. Finally, we demonstrate that automated
evaluation models are a reasonable surrogate for lexical matching in some
circumstances, but not for long-form answers generated by LLMs. The automated
models struggle in detecting hallucinations in LLM answers and are thus unable
to evaluate LLMs. At this time, there appears to be no substitute for human
evaluation.Comment: ACL 2023; code and data released at
https://github.com/ehsk/OpenQA-eva